14 ◾ Bioinformatics
The fourth line of the FASTQ file contains the ASCII-coded string that represents the
per base Phred quality scores. The numeric value of each ASCII character corresponds to
the quality score of a base in the sequence line.
Researchers usually acquire raw sequencing data for their own research from a sequenc-
ing instrument. Raw sequencing data can also be downloaded from a database, where sci-
entists and research institutions deposit their raw data and make it available for public. In
either case, the raw sequencing data is usually obtained in FASTQ files. The NCBI SRA
database is one of the largest databases of raw data for hundreds of species. The FASTQ
files are stored in Sequence Read Archive (SRA) format, and they can be downloaded and
extracted using SRA-toolkit [9], which is a collection of programs developed by the NCBI
and can be downloaded and installed by the instructions available at “https://trace.ncbi.
nlm.nih.gov/Traces/sra/sra.cgi”.
For the purpose of demonstration, we will download raw data from the NCBI SRA data-
base. We will use a single-end FASTQ file with the run ID “SRR030834”, whose size is
3.5G. The FASTQ file contains reads sequenced from an ancient hair tuft of 4000-year-old
male individual from an extinct Saqqaq Palaeo-Eskimo, excavated directly from culturally
deposited permafrozen sediments at Qeqertasussuk, Greenland. To keep file organized, you
can create the directory “fastqs” and then download the FASTQ file using “fasterq-dump”
TABLE 1.3 Illumina FASTQ Identifier Line Elements [8]
Identifier Line Element
Description
@
The beginning of the read identifier line
<instrument>
Instrument ID or sequence ID
<run num>
The number of the run on the instrument
<flowcell ID>
The flow cell ID
<lane>
The number of lane where the read was sequenced
<tile>
The number of the tile where the read was sequenced
<x>
The X-coordinate of the DNA cluster
<y>
The Y-coordinate of the DNA cluster
<UMI>
Only if a unique molecular identifier (UMI) is used
<read>
The read number (1 for single read or 2 for paired end)
<filtered>
Y if the read passed the filter and N if didn’t pass
<control num>
0 (none of the control bits are on) or an even number
<index>
The sample number or read index
FIGURE 1.6 A FASTQ file format showing three records.